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ABSTRACT 

We develop a method to measure the probability, P(N; M), of finding N galaxies in 
a dark-matter halo of mass M from the theoretically determined clustering proper- 
ties of dark-matter halos and the observationally measured clustering properties of 
galaxies. Knowledge of this function and the distribution of the dark matter com- 
pletely specifies all clustering properties of galaxies on scales larger than the size of 
dark-matter halos. Furthermore, P(N; M) provides strong constraints on models of 
galaxy formation, since it depends upon the merger history of dark-matter halos and 
the galaxy-galaxy merger rate within halos. We show that measurements from a com- 
bination of the 2MASS and SDSS or 2dFGRS datasets will allow P(N; M) averaged 
over halos occupied by bright galaxies to be accurately measured for N — 0-2. 



1 INTRODUCTION 

Recent work on the clustering properties of galaxies has focussed on the connection between galaxies and dark matter halos, 
using theoretical models of galaxy formation (Kauffmann, Nusser & Steinmetz 1997; Kauffmann et al. 1999a, b; Diaferio et 
al. 1999; Benson et al. 2000a,b; Somerville et al. 2000; Seljak 2000) or observational data (Peacock & Smith 2000) to 
determine the number of galaxies that reside within halos of given mass. Models of this type have been successful in explaining 
the near power-law nature of the galaxy-galaxy correlation function (Kauffmann et al. 1999a; Benson et al. 2000a), and the 
strong clustering of Lyman- break galaxies at z ~ 3 (Governato et al. 1998; Baugh et al. 1998; Wechsler et al. 2000). 

In this context, Benson et al. (2000a) calculated the quantity P(N;M), the probability of finding N galaxies brighter 
than a specified luminosity, Lo, in a dark matter halo of mass M, from the galaxy-formation model of Cole et al. (2000). 
This quantity is particularly powerful since, once a model for the distribution of dark-matter halos is chosen, P(N; M) fully 
determines all clustering properties of galaxies on scales larger than the size of dark matter halos (on smaller scales the spatial 
distribution of galaxies within individual halos becomes important). As such, P(N;M) may be thought of as a complete 
description of the galaxy — dark-matter bias including any non-linearity and stochasticity. Furthermore, if P(N; M) can be 
measured from the observed clustering pattern of galaxies it provides a direct and powerful constraint for models of galaxy 
formation since it is sensitive to the merger history of dark-matter halos, and to the rate of galaxy-galaxy mergers within 
halos. 

In this paper we describe how P(N; M) may be measured directly from a volume-limited galaxy-redshift survey by using 
a counts- in-cells analysis to determine the probability of finding N galaxies in a cell, S(N). The remainder of this paper is laid 
out as follows. In §^ we describe our method and give the formulae relating P(N; M) and S{N) for all N. In §^ we investigate 
how well P(N; M) can be measured from a combination of the Two Micron All Sky Survey (Skrutskie et al. 1995; 2MASS) 
and Two-degree Field Galaxy Redshift Survey (Dalton 2000; 2dFGRS) or Sloan Digital Sky Survey (Blanton et al. 2000; 
SDSS) datasets using the mock galaxy catalogues of Benson et al. (2000a), and finally in §0 we present our conclusions. 



2 METHOD 

We will assume that the galaxy population of a dark matter halo is determined only by the mass of that halo. Whilst it is the 
distribution of halo masses which varies most significantly as a function of environment (Lemson & Kauffmann 1999) other 
quantities are also known to correlate with environment, for example the concentration of the halo (Bullock et al. 2000). In 
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practice the properties of galaxies may depend upon such variables thereby altering the clustering properties of the galaxies. 
In principle, other variables could be included in our analysis by defining a function P(N;M, x), where x represents other 
variables upon which the properties of galaxies may depend. However, current datasets are insufficient to allow meaningful 
measurements of such a function to be made and so we will restrict ourselves to considering P(N; M) only at present. 

We must also assume that P(0, M) = 1 for all M < Mo; i.e. halos below mass Mo never contain any galaxies brighter 
than Lo. This is a reasonable and necessary assumption — if halos of arbitrarily low mass could host bright galaxies then, 
since there are an infinite number of halos per unit volume (at least according to the Press-Schechter theory) there would be 
an infinite number of galaxies per unit volume. Having made this assumption we can ignore halos of mass less than Mo as 
they make no contribution to the galaxy population that we are considering. 

How may we determine Mo for a given galaxy population? One approach would be to make use of dynamical mass 
estimates (e.g. Vogt et al. 1997). However, these are not available for all types of galaxy. An alternative method is to use 
galaxy samples selected at near-infrared wavelengths from which we can infer a stellar mass from the sample magnitude limit 
(Kauffmann & Chariot 1998). Then 

Mo > 7^M„ (1) 

where M* is the stellar mass. This lower limit on Mo corresponds to the case where the entire gaseous mass of a halo is turned 
into stars. The halo must have at least this mass to make the observed galaxy. The conversion from K-band light to stellar 
mass is uncertain by a factor of approximately two (Brinchmann & Ellis 2000), so Mo should realistically be taken to be two 
to three times lower than the value inferred from eqn. ([!]). 

While any clustering statistic can be written in terms of P(N; M) and the clustering properties of dark matter halos 
(for example, the two-point correlation function expressed in terms of P(N; M) is given in Appendix |X|) a particularly simple 
relation can be found for S(N), the probability of finding N galaxies brighter than Lo in a cell of given size and shape. While 
these statistics can in principle reveal P(N; M) for any N and for a range of M, in practice measurement is severely limited by 
unavoidable noise in the data as will be shown in §[| Nevertheless, useful constraints can still be obtained from this analysis. 
In the remainder of this section we develop the relations necessary to determine P(N; M) for all N and M, but will only 
make use of the simplest forms of these relations in 

The probability of finding N galaxies in a cell of given size and geometry can be expressed in terms of the probability 
of finding a certain combination of halos in that cell and the probabilities of finding different numbers of galaxies in each of 
those halos. In order to measure P{N; M) it is necessary to divide halos into a number of mass ranges, or bins. We will then 
refer to the mean value of P(N;M) averaged over all halos in mass bin i as Pi(N), such that Pi(N) is the probability of 
finding N galaxies in a halo selected at random from mass bin i, i.e. 

f M,+1 PIN; M)-^ T (M)dM 

p.(N) - — dM (2) 

where Mi is the lower bound of the i th mass bin. 

Let S(N) be the probability of finding N galaxies in a cell (of given size and geometry). For a particular choice of 
cosmology and dark matter let Q(Ni, N2, ■ ■ ■ , N n ) be the probability of finding the centres of Ni halos in mass bin 1, N2 in 
bin 2 etc. in a cell, where we have used a total of n mass bins. (We take the centre of mass to define the halo centre.) Note 
that in general Q(Ni,N2, ■ ■ ■ , N n ) 7^ Q(Ni)Q(N2) . . . Q(N n ) since the distribution of halos is typically correlated. Note that 
S(N) is an observationally measurable quantity, and Q(Ni, N2, ■ ■ ■ , N n ) can be obtained from a structure formation model. 
As we show below, these two quantities are related, and that relation depends upon P(N; M). Measurement of S(N) therefore 
allows us to measure P(N; M). 

We can write S(N) as the sum over all possible combinations of N\,N2, . . . , N n of Q(Ni,N2, . . . , N„) multiplied by the 
probability of finding ii galaxies in the first halo, 12 in the second etc. summed over all combinations of 11,12, ■ ■ ■ which satisfy 
the constraint y\ ij = N (i.e. only those combinations which produce the correct number of galaxies in the cell contribute 
to the total probability). 

For example, S(0) is given by 

00 oc 00 n 

E E •■■ E Q(Ni,N 2 ,...,N n )l[Pp(0), (3) 

JV 1= 0iV2=0 N n =0 j=l 

while S(l) and S(2) are given by 
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where iV} is the number of times a halo in mass bin i is populated by j galaxies and C(N^\ . . .) is the number of 
distinct permutations of each term which contribute to the probability. The weighting factor C(N^ , N$ . . .) is the number 
of ways to populate the available halos with the galaxies divided by the number of times such terms appear in the summation. 
In general, 
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where Ni = E°!l N^' . For example 
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As expected, S(N) depends only upon those Pi(j) for which j < N. Therefore, we may begin by finding the Pj(0)'s using 
the expression for 5(0), then proceed to find the Pi(l)'s using the expression for 5(1) and the previously calculated Pi(0) 
and so on. Each expression therefore involves n unknowns (for S(N) these are the Pi(N)), and so we must have a measure 
of S(N) for at least n different cell sizes to solve the equations. While the above equations cannot be solved analytically 
for the Pi(N), solutions can be found relatively simply using Powell's method (Press et al. 1992) to minimize the quantity 
X 2 = J2id S i° hS) ( N ) - Sl modcl) (N)]/AS° hs {N)) 2 for example, where the sum is taken over all cell sizes considered. 



In general, the expression for S(N) will be of the form 
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3=1 



where the expression "N halo terms" refers to all terms corresponding to galaxies shared between N different halos (i.e. the 
first two sums in the above expression are therefore "1 halo terms" and "2 halo terms"). 

At this point it is instructive to briefly consider the assumptions made in obtaining the above relations. Firstly we have 
assumed that all galaxies lie at the centre of the halo they occupy. Then, a halo being in a cell guarantees that any galaxies 
it contains are also in the cell. In reality galaxies are likely to be spread throughout the halo with some unknown spatial 
distribution, and so some galaxies may lie outside of the cell even though their halo centre is inside (and conversely some 
galaxies may lie inside even though their halo centre is outside). While our analysis could be extended to account for such 
"edge effects" this would require us to assume a distribution for galaxies within individual halos. We prefer to concentrate on 
scales where these effects are negligible. In §^ we demonstrate that edge effects are an insignificant source of error. 

Secondly we assume that the galaxy occupancy of all halos in a mass bin is well described by a single set of Pj(jV). 
Providing P(N; M) varies little across the mass bin this is a reasonable assumption. However, as we will see in §^ noisy data 
may limit us to considering a single mass bin, extending from Mo to infinity, for which the above assumption is unlikely to hold 
true. While we implicitly assume that the Pi(N) are independent of the number of halos found in a cell the Press-Schechter 
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(Press & Schechter 1974) formalism tells us that high density regions of the Universe will contain preferentially higher mass 
halos than low density regions. Consequently cells which contain many halos will preferentially contain high mass halos, while 
in cells containing few halos the halos are likely to be of low mass. If, for example, P(0; M) is a decreasing function of M then 
cells with few halos (which are typically the most abundant) will contain zero galaxies more often than our model assumes. 
The resulting increase in S(0) can be seen in the synthetic datasets used in While this has a non-negligible effect on S(0), 
particularly for large cell sizes, the value of P(0) recovered is quite insensitive to this since most of the signal comes from 
small cell sizes. 



3 APPLICATION TO SYNTHETIC DATASETS 

Perhaps the most suitable dataset to apply this technique to will be a combination of the 2MASS survey with a large 
redshift survey (e.g. the 2dFGRS or the SDSS). The 2MASS survey provides near-infrared photometry which allows Mo to 
be estimated, but must be complemented by a redshift survey in order to provide a 3D map of the galaxy distribution.^] A 
volume limited 2MASS sample of galaxies brighter than Mk — 5 log h = —23.5 would have a volume of order 3 x 10 6 fe _3 Mpc 3 
in the 2dFGRS survey area (or around four times this volume in the SDSS survey area). As this is very similar to the volume 
of the GIF ACDM N-body simulation used by Benson et al. (2000a) we will use their synthetic galaxy catalogues to estimate 
how well P(N;M) could be recovered from such a dataset. We do not attempt here to reproduce the full details of the 
survey geometry or selection function, but merely consider a synthetic dataset with comparable volume and number density 
of galaxies in order to estimate the accuracy with which P(N; M) may be recovered from such a survey. 

We consider only galaxies brighter than Mk — 51og/t = —23.5 to ensure that we need only consider halos which are well 
resolved by the GIF simulation. These galaxies live in halos with masses greater than 10 12 /i _1 Mq in this model (the particle 
mass m the GIF ACDM simulation is 1.4 x 10 10 h~ 1 M Q ). Inferring M from the K-band magnitude of the galaxies we find 
Mo = 8x 10 11 ft -1 Mq. We therefore conservatively set Mo = 4 x 10 11 h~ x Mq. We consider only one bin of halo mass, i.e. all 
halos more massive than 4 x 10 h~ Mq. While this technique can in principle be applied to several halo mass bins we find 
that in practice this is very difficult. Typically the values of Pi(N) for the more massive bins are poorly constrained since 
there are very few halos in the mass range, or else the solutions of eqn. (^) for different cell sizes are degenerate in the Pi(N)'s 
and so only allow certain combinations of Pi(N)'s to be accurately measured. Very large datasets, with correspondingly small 
errors may allow a measurement of Pi(N) in more than one mass bin, although this will probably require a treatment of edge 
effects which must eventually become the dominant source of error. 

We measure S(N) in cubic cells of side I/ cu be = 5.0, 7.5, 10.0, 12.5, 15.0, 17.5 and 20.0/i _1 Mpc, and measure Q(N) for 
the same cell sizes. Both S(N) and Q(N) are calculated for galaxy/halo positions in redshift space. For smaller cubes edge 
effects begin to become a significant source of error, while for larger cubes the GIF simulation contains very few independent 
volumes. 

The left-hand panel of Fig. |l| shows Q(N) for L cu b c = 5, 10 and 20/i _1 Mpc, while the right-hand panel shows S(0) (squares) 
and Q(0) (crosses) as functions of L cu b e - Errors are estimated assuming Poisson statistics and that there are (Z/GiF/icubc) 3 
independent volumes in the simulation, where Lgif = 141.3/i _1 Mpc is the size of the GIF ACDM simulation volume. This 
is known to underestimate the true errors (e.g. Kim & Strauss 1998), but is sufficient for our present purposes. Note that 
placing all galaxies at the halo centre (solid squares), or placing one galaxy at the centre and making satellite galaxies trace 
the dark matter of their halo (open squares) has little effect on the measured S(0), i.e. edge effects are unimportant for this 
sample. For the smallest cells we consider Q(0) accounts for around 65% of the value of S(0), and makes a smaller contribution 
for the larger cells. Also shown is the value of S(0) predicted by eqn. ^ with the recovered value of -Pi(O) (dashed line) and 
the true value of -Pi(O) (solid line). For the larger cell sizes neither gives a good fit to the mock data points. This is due to 
the failure of our assumption that Pi(N;M) is roughly constant throughout the mass bin (as discussed in §^). However, as 
we discuss below this does not drastically alter the recovered values of P\(N). 

We determine Pi{N) from the measured S(N) and Q{N) by solving eqn. for Pi(N) by minimizing x 2 (as described 
in §^). Figure]^ shows the true Pi(N) as measured directly from the full model and from the GIF synthetic galaxy catalogue 
(solid squares and solid triangles respectively), with errorbars computed assuming Poisson statistics, and the Pi(N) recovered 
from the synthetic galaxy catalogue via the S(N)'s with all galaxies at their halo centre (open triangles) and with satellite 
galaxies tracing the dark matter of their halo (open squares), with errorbars estimated from Ax 2 - The first three Pi(N) are 
recovered with reasonably accuracy from the synthetic galaxy catalogues. (For N = 0, 1 and 2 the recovered Pi(N) differ 



* While P(N; M) could be measured from a 2D dataset, the 3D information will provide a much stronger constraint. 
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Figure 1. Left-hand panel: The probability of finding N halos more massive than 4 X 10 11 Mq in cubes of sides 5 (solid line), 
10 (dashed line) and 20/i _1 Mpc (dot-dashed line) in redshift-space. Dotted lines indicate errors on these quantities assuming Poisson 
statistics. The inset shows an expanded view of the low-TV region. Right-hand panel: The probability of finding zero galaxies brighter 
than Mk — 51ogh = —23.5 in cubic cells of side L C ubo in rcdshift-space in the simulations of Benson et al. (2000a). Solid squares show 
the result when all galaxies arc placed at the centres of dark matter halos, while open squares indicate the result when satellite galaxies 
are made to trace the dark matter in their halo. Errors are calculated assuming Poisson statistics. The solid line shows 5(0) calculated 
from the measured Q(N) (as shown in the left-hand panel) and the value of P(0) measured directly from the models of Benson et al. 
(2000a). Crosses with errorbars show the contribution of Q(0) to 5(0). The Q(0) contribution is around 65% for L C ubo = 5/i -1 Mpc, and 
falls for larger values of L cu ^ c . 

from the true values by 3%, 17% and 46% respectively, although we caution that these values are from a single realization of 
the synthetic galaxy catalogue and so may not be representative.) 

A weakness of this approach is that the equation for Pi(N) depends upon all Pi(N') where N' < N. Hence, any errors 
in the determination of Pi(0) affect the estimate of Pi(l) etc. In the case of the synthetic galaxy catalogues used here we can 
recover Pi (TV) accurately for N = 0, 1, 2. When we consider Pi (3), however, we find that the contribution to S(3) from terms 
involving only Pi(0), Pi(l) and Pi(2) already exceeds the measured value. Thus the solution to the equation requires that 
Pi (3) be negative, which is of course impossible. Thus with a dataset of this size only the first few Pi (TV) can be measured. 



4 DISCUSSION 

We have described how the distribution of galaxies amongst halos, as described by the function P(N; M) (the probability 
of finding N galaxies brighter than a specified luminosity Lo in a halo of mass M), can be measured directly from a galaxy 
redshift survey once a model for the spatial distribution of dark matter halos is assumed. Specifically we derive relations 
between the observationally measurable quantity S(N) (the probability of finding N galaxies in a cell) and the theoretically 
determinable quantity Q(Ni, N2, . . . , N n ) (the probability of finding different numbers of dark-matter halos in a cell). These 
relations depend upon P(N; M), thereby allowing P(N; M) to be determined from observational determinations of S(N) and 
a model of structure formation. 

The distribution function P(N; M) provides a complete description of galaxy bias (at least on scales larger than the size 
of halos) in terms of physically meaningful quantities and will also be sensitive to the merging history of dark matter halos 
and the rate of galaxy-galaxy mergers within dark matter halos. We have presented the technique in its simplest form. We 
defer a more detailed study of errors (including edge effects) and the limitations imposed by the simplifying assumptions made 
to a future paper. 

Our approach assumes a model for the underlying distribution of dark matter halos, and the results obtained will 
therefore be dependent on that model. Measurements of key cosmological parameters, perhaps from measurements of the 
cosmic microwave background (Jungman et al. 1996; Bond, Efstathiou & Tegmark 1997), and the dark matter power 
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Figure 2. Pi (AT) for halos more massive than 4 X 10 h~ Mpc and galaxies brighter than Mr — 51og/i = —23.5. Solid squares show 
P\(N) taken directly from the full model of Benson ct al. (2000a), while solid triangles show that taken directly from the GIF synthetic 
galaxy catalogue of Benson ct al. (2000a). Open triangles show the Pi(N) recovered from the GIF synthetic galaxy catalogue via 
determinations of S(N) when all galaxies are placed at the centre of their halo (error bars are computed assuming Poisson statistics), 
while open squares show the result when satellite galaxies are made to trace the dark matter content of their halo (errorbars computed 
from Ax 2 ). The points are offset slightly in N for clarity. 



spectrum, from weak lensing (e.g. Tyson, Wittman & Angel 2000) or Lyman-a forest studies (Croft et al. 1998), in the near 
future should allow the halo distribution to be fully determined. 

Using the mock galaxy catalogues produced by Benson et al. (2000a) we have shown that P(N; M) averaged over all 
halos more massive than 4 x 10 11 h^ 1 Mq is measurable for the first few values of N from a combination of the 2MASS dataset 
with a redshift survey such as the SDSS or 2dFGRS. To measure P(N; M) for higher JV or as a function of M would require 
larger datasets and a detailed consideration of edge effects. Measurement of this quantity from forthcoming galaxy redshift 
surveys will therefore provide strong constraints for models of galaxy formation and clustering and reveal a great deal about 
the connection between galaxies and dark matter. 
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APPENDIX A: THE TWO-POINT CORRELATION FUNCTION OF GALAXIES 

The two-point correlation function is a familiar clustering statistics easily expressed in terms of P(N; M). Suppose there is a 
halo of mass Mi to Mi + dMi in a small volume element dVi. Let AQ12 be the probability of finding a halo of mass M2 to 
M2 + AM2 in a small volume element AV2 a distance r away from the first halo. We can write 

dn dn 

dQia(r) = (1 + Zi2(r)) — (Mi) — (M 2 )dMidM2dVidV2, (Al) 

where £12(7") is the cross correlation function of these halos. A single halo pair may contribute many galaxy pairs. On average 
the above halo pair will contribute 

dN 12 (r) = N(Mi)N(M 2 )dQi 2 (r) (A2) 
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galaxy pairs, where N(M) = M) is the mean number of galaxies in a halo of mass M. For a random distribution 

of galaxies we would expect 

dN$(r) = N(M 1 )N{M 2 )^{Mi)^(M 2 )dM 1 dM2dVidV2. (A3) 



Integrating eqns. (A2) and (|A3J) over all halo masses we find the total number of galaxy pairs in the clustered and random 
cases to be 

/>oo />OC , j 

dN BB (r) = N(M 1 )N(M 2 )(l + ^ 2 (r))-^(M 1 )-^(M 2 )dM 1 dM2dV 1 dV2 (A4) 

poo poo j j 

d^wW = N(M 1 )N(M 2 )j^{M 1 )^(M 2 )dM 1 dM 2 dV 1 dV2 = nl, l dV 1 dV 2 , (A5) 

where n ga i is the mean number density of the galaxies. The galaxy-galaxy correlation function is defined to be 

divW(r) 



OO /"OO 



M JM 



ei2(r) j v(Mr)iV(M 2 ) ^ (Mi) ^ (M2)dMidM2 _ 
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